The aim of this project is to use a sample body of text - extracted from blogs, twitter and the news - in order to create a prediction algorithm able to predict the next word of a given input phrase. The ultimate goal is to implement this prediction algorithm in a Shiny app. As such, in the interests of usability, the memory requirements and runtime of the prediction algorithm will be a key consideration during its development.
This report covers several areas of the predictive algorithms development. First, the predictive algorithm methodolgy and setup is explained. Subsequently, the algorithm development process is described - this consists of pre-processing carried out on the input texts or “corpus”; exploratory analysis, modelling, benchmark testing and initial output results. Finally, there is a discussion of how the predictive algorithm could be implemented in the Shiny App, and what potential improvements could be made.
In this report, two different algorithms are investigated:
To generate the training corpus (generated from the news, blog and twitter sources) from which the algorithms extract predictions, the text is word tokenised under the following assumptions:
Both algorithms have the same initial setup process (1-3) - and subsequently diverge:
[Markov Chain Model - w/ Stupid Back-Off]
[Katz Back-Off Model]
See here for how conditional probabilities are calculated for an n=2 Katz Back-Off Model
The word prediction algorithm utilises text sourced from blogs, the news and twitter. These datasets were provided as part of the Capstone dataset. In the present analysis, only 10% of the lines were imported from each dataset.
# proportion of lines to read in
readprop <- 0.1
#document details
lang = "en_US"; source_list = c("news","blogs","twitter")
# Output file location
filepath <- paste0("data/",lang,"/modelling/");
if(!dir.exists(filepath)) dir.create(file.path(filepath), recursive = TRUE)
#Sample Texts
all_texts <- data.frame(matrix(ncol=5,nrow=3,
dimnames=list(NULL, c("text", "source", "lines", "prop", "size"))))
for (i in 1:3){
# data location
datdir <- paste0("data/",lang,"/",lang,".",source_list[i],".txt")
# Sample 50% of lines
all_texts$lines[i] <- sapply(datdir,countLines)
all_texts$text[i] <- paste(readLines(datdir,
round(readprop*all_texts$lines[i]),
encoding="UTF-8"),collapse=" ")
all_texts$source[i] <- source_list[i]
all_texts$size[i] <- paste0(round(file.info(datdir)$size / 1024.0 ^ 2,2),"MB")
all_texts$prop[i] <- readprop
}
The metadata of each text file was also extracted:
## source lines size
## 1 news 1010242 196.28MB
## 2 blogs 899288 200.42MB
## 3 twitter 2360148 159.36MB
The three texts were combined into a single corpus and tagged with their source (e.g. "twitter) in the metadata. The corpus was then tokenised by word, with punctuation, symbols, numbers, urls and separators removed.
# Create corpus
doc.corpus <- corpus(all_texts, text_field = "text")
docnames(doc.corpus) <- docvars(doc.corpus, "source")
# Tokenise corpus into words
doc.tokens <- tokens(doc.corpus,
what = "fasterword",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE)
As the task was to create a word prediction algorithm that excluded profanities, it was then necessary to standardise the word tokens by converting them to lower case, and then use the quanteda package tokens_select function to restrict the tokenised corpus to “clean” words from the English dictionary.
doc.tokens.lo <- tokens_tolower(doc.tokens)
if(!file.exists("data/en_US/profanity_list")){
url <- "https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
download.file(url, "data/en_US/profanity_list")
}
profanity_list <- data.frame(read.csv("data/en_US/profanity_list.csv", header = FALSE, sep = ","))
#remove profanities
doc.tokens.lo.rm <- tokens_select(doc.tokens.lo, profanity_list$V1, selection = "remove")
#Keep only english words
doc.tokens.clean <- tokens_select(doc.tokens.lo.rm, GradyAugmented, selection = "keep")
The processed corpus was then tokenised into a document frequency matrix (dfm) containing n-grams for n=1,2,3,4. Separate frequency tables were then created for the complete corpus and the individual source texts.
# Create n-grams - unigram to pentagram - and convert to a DFM
doc.dfm <- dfm(tokens_ngrams(doc.tokens.clean, n=1:4))
# Create frequency tables - by source
freq.table_by_source <- tibble(textstat_frequency(doc.dfm, groups = source)) %>% dplyr::mutate(ngram = str_count(feature,"_")+1) %>% select(feature,frequency,group,ngram)
freq.table <- tibble(textstat_frequency(doc.dfm)) %>% dplyr::mutate(ngram = str_count(feature,"_")+1) %>%
select(feature,frequency,ngram)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2535809 135.5 4839162 258.5 3973498 212.3
## Vcells 4484963 34.3 100853164 769.5 125811238 959.9
A summary of the n-gram frequency tables is given below for the combined text sources:
| ngram | Rows |
|---|---|
| 1 | 48733 |
| 2 | 1684370 |
| 3 | 4920678 |
| 4 | 6864029 |
and each individual text source:
| ngram | group | Rows |
|---|---|---|
| 1 | blogs | 40494 |
| 1 | news | 34714 |
| 1 | 30116 | |
| 2 | blogs | 829852 |
| 2 | news | 685019 |
| 2 | 647248 | |
| 3 | blogs | 2161041 |
| 3 | news | 1606232 |
| 3 | 1655109 | |
| 4 | blogs | 2884951 |
| 4 | news | 2002044 |
| 4 | 2160069 |
From these tables, we note that blogs have the highest number of unique n-grams across all orders, however, the table below shows it also has the highest number of characters:
| group | characters |
|---|---|
| blogs | 87483880 |
| news | 67184048 |
| 62844900 |
In this section, the n-grams in the text corpus are explored for each source text.
Figure 1: Word cloud of 100 most frequent unigrams
As expected, typical “stopwords” are the most common uni-grams across all source texts.
Frequency PlotsFigure 2: Uni-gram Frequency Plots; (left) unigram frequencies for top 100 most frequent uni-grams, (right) cumulative frequency density up to 20th percentile of uni-grams.
Additionally, the frequency plots show that the source texts have a similar distribution of uni-grams. 5% of uni-grams contribute to nearly 90% of the total uni-gram frequency in all source texts except news where it is closer to 80%.
Figure 3: Word cloud of 100 most frequent bi-grams
While blogs and news show similar bi-grams, twitter appears to have diverged. This perhaps reflects the unique speech patterns observed on the Twitter platform.
Frequency PlotsFigure 4: Bi-gram Frequency Plots; (left) bi-gram frequencies for top 100 most frequent bi-grams, (right) cumulative frequency density up to 20th percentile of bi-grams.
The bi-gram frequency plots reflect this divergence as the most frequent Twitter bi-grams have notably less instances than the top-ranked in blogs and news texts. However, the distribution of Twitter bi-grams after this initial peak ends up following the blogs distribution quite closely.
Figure 5: Word cloud of 100 most frequent tri-grams.
Again, blogs and news texts show very similar tri-grams, whereas twitter appears to have a completely different selection of most frequency tri-grams.
Frequency PlotsFigure 6: Tri-gram Frequency Plots; (left) tri-gram frequencies for top 100 most frequent tri-grams, (right) cumulative frequency density up to 20th percentile of tri-grams.
The frequency of tri-grams is notably lower than that of uni-grams and bi-grams. The news frequency distribution remains distinct from blogs and twitter - with the higher variation in text patterns (perhaps signifying more advanced/complex communication)resulting in a lowwer curve.
Figure 7: Quad-gram Word cloud
Blogs and news retain similar quad-grams. Twitter continues to have a notably distinct grouping.
Frequency PlotsFigure 8: Quad-gram Frequency Plots; (left) quad-gram frequencies for top 100 most frequent quad-grams, (right) cumulative frequency density up to 20th percentile of quad-grams.
The number of instances of the most frequent quad-grams is significantly lower than the lower order n-grams across all source texts. The linear relationship between the CDF and percentile (just before the 5th percentile) corresponds to quad-grams with a single instance. This also suggests that going to higher order n-grams (i.e. penta-grams) is unlikely to improve the performance of the model algorithm (for the given training dataset).
The word prediction alogrithms are tested with a set of three text inputs:
for four different training corpuses:
“Thank you for”
Markov Chain
results_list_mc[[1]]
## word.Comb word.blogs word.news word.twitter
## 1 the your your the
## 2 your the letting your
## 3 following making making following
## 4 all all the all
## 5 being being being sharing
Katz Back-Off
results_list_kbo[[1]]
## word.Comb prob.Comb word.blogs prob.blogs word.news prob.news word.twitter prob.twitter
## 1 the 0.31 your 0.16 your 0.28 the 0.35
## 2 your 0.10 the 0.08 letting 0.09 your 0.09
## 3 following 0.05 making 0.06 a 0.05 following 0.06
## 4 all 0.04 all 0.05 making 0.03 all 0.04
## 5 being 0.04 being 0.03 the 0.03 sharing 0.03
“Follow me”
Markov Chain
results_list_mc[[2]]
## word.Comb word.blogs word.news word.twitter
## 1 i on on i
## 2 and and today and
## 3 on to <NA> back
## 4 back at <NA> please
## 5 please out <NA> on
Katz Back-Off
results_list_kbo[[2]]
## word.Comb prob.Comb word.blogs prob.blogs word.news prob.news word.twitter prob.twitter
## 1 i 0.11 on 0.67 on 0.25 i 0.11
## 2 and 0.08 and 0.07 today 0.25 and 0.08
## 3 on 0.08 to 0.02 to 0.07 back 0.08
## 4 back 0.08 at 0.02 and 0.03 please 0.06
## 5 please 0.06 out 0.02 that 0.03 on 0.06
“and a case of”
Markov Chain
results_list_mc[[3]]
## word.Comb word.blogs word.news word.twitter
## 1 the the trying the
## 2 water this in water
## 3 this shout jan red
## 4 trying underway international plagiarism
## 5 in there beer carpal
Katz Back-Off
results_list_kbo[[3]]
## word.Comb prob.Comb word.blogs prob.blogs word.news prob.news word.twitter prob.twitter
## 1 the 0.16 the 0.16 the 0.09 the 0.27
## 2 water 0.03 this 0.07 a 0.08 water 0.12
## 3 this 0.03 shout 0.02 trying 0.04 red 0.04
## 4 and 0.01 underway 0.02 in 0.04 plagiarism 0.04
## 5 an 0.01 there 0.02 jan 0.04 carpal 0.04
Model Comparisons
The output of the Markov Chain (w/ Stupid Back-Off) and Katz Back-Off model is almost identical for the top three predicted words. From the Katz Back-Off algorithm we also find that the conditional probability of the top word is usually significantly higher than the bottom four words. However, the limitations of the Markov Chain model are also apparent, as the word predictions for the text input “Follow me” is limited to a list of two, whereas the Katz Back-Off model returns five predictions as a result of using unobserved n-grams in the model prediction.
Source Text Comparisons
The top five word predictions generated by the combined corpus generally differs from the model predictions when a single source text is used. This will be a consideration of the final algorithm, as it is well-known that speech patterns will differ between communication platforms (e.g. the bigram input “Follow me back” is a commonly used phrase on Twitter). Ergo, a word prediction tool may benefit (in regards to usability) from using a text corpus comprised of text sourced specifically from the target platform.
With respect to app implementation, the following issues will be addressed:
# =============================
# Dynamic Katz Back-Off Model
# =============================
ngram_kbo_DF <- function(text.input,freq.table,nwords,len=3){
text.input <- iconv(text.input, from = "UTF-8", to = "ASCII", sub = "")
text.token <- tokens(char_tolower(text.input),
remove_punct = TRUE,
remove_symbols = T,
remove_separators = T,
remove_twitter = T,
remove_hyphens = T,
remove_numbers = T)
text.token <- as.vector(unlist(text.token))
text.n <- length(text.token)
nDisc = 0.5
nloDisc = 0.5
# 1.a Split frequency table by ngram
#select maximum ngram order to inspect in frequency table
ngram_max <- min(len+1,max(freq.table$ngram))
ft_list <- list()
#create list of ngram frequency tables
for (j in 1:ngram_max){
ft_list[[j]] <- freq.table %>% filter(ngram == j)
}
rm(freq.table)
# Extract observed n-gram and discounted probability
i=0
len <- min(text.n,len)
if(len>0){i = text.n-len}
while (0 < text.n - i){
#convert text input into ngram format
text.ngram <- paste0(tail(text.token,text.n-i),collapse="_")
text.ngram.lo <- paste0(tail(text.token,text.n-i-1),collapse="_")
message("Searching for ngram... ", text.ngram)
# 1b. Create dynamic ngram tables
freq.table_lo_p1 <- ft_list[[max(text.n - i + 1,1)]]
freq.table_lo <- ft_list[[max(text.n - i,1)]]
freq.table_lo_m1 <- ft_list[[max(text.n - i - 1,1)]]
freq.table_uni <- ft_list[[1]]
ngram_match <- freq.table_lo %>% filter(feature == text.ngram)
ngramlo_match <- freq.table_lo_m1 %>% filter(feature == text.ngram.lo)
# =============================
# If word exists in corpus - Katz Back-Off Model
# =============================
if(nrow(ngram_match)>0){
# 2. find observed ngram predictions (i.e. (n+1)-gram) and compute discounted probability
ngram_Obs_pred <- freq.table_lo_p1 %>% filter(grepl(paste0("^",text.ngram,"_"),feature)) %>%
dplyr::mutate(prob.pred = ((frequency - nDisc) / ngram_match$frequency),
next.word = gsub(paste0("^",text.ngram,"_"),"",feature),
feature.pred = feature)
# =============================
# If text input is a single word - Unigram Katz Back-Off Model
# =============================
if (nrow(ngram_Obs_pred)>0 && text.n-i == 1){
ngram_pred <- unigram_KBO(freq.table_uni,text.token,text.ngram,ngram_Obs_pred)
return(ngram_pred[1:min(nwords,nrow(ngram_pred)), c("next.word","prob.pred")])
# =============================
# Dynamic Katz Back-Off Model
# =============================
} else if (nrow(ngram_Obs_pred)>0 && text.n-i > 1) {
message("Running Dynamic Katz Back-Off Model")
# 3. find unobserved n-gram predictions i.e. possible (n+1) word of n-gram input
#Need to extract unigrams not in list of predicted next word from observed ngram predictions
ngram_unObs_tail <- freq.table_uni %>% filter(!(feature %in% ngram_Obs_pred$next.word))
# 4. Calculate discounted probability mass - weighting for unobserved n-grams
# Computed from observed n-grams that form the tail (i.e. n-1) of the n-gram input
prob_mass <- freq.table_lo %>% filter(grepl(paste0("^",text.ngram.lo,"_"),feature)) %>%
summarise(alpha = 1 - (sum(frequency - nloDisc) / ngramlo_match$frequency))
# 5. Calculate backed-off probabilities for n-grams
# 5a. Generated backed-off n-grams using the unobserved (n+1)-gram tails - see (3)
ngram_Bo <-ngram_unObs_tail %>%
dplyr::mutate(feature.BO.ngram = paste(text.ngram.lo, feature, sep = "_"))
# 5b. Extract observed frequencies of backed-off n-grams
ngram_Bo_Obs <- freq.table_lo %>% filter(feature %in% ngram_Bo$feature.BO.ngram,
str_count(feature,"_") > 0)
# 5c. Identify unobserved backed-off n-grams - using 5b
ngram_Bo_unObs <- ngram_Bo %>% filter(!(feature.BO.ngram %in% ngram_Bo_Obs$feature)) %>%
dplyr::mutate(feature = feature.BO.ngram) %>% select(-feature.BO.ngram)
# 5d. Generate probabilities of observed backed-off n-grams
ngram_Bo_Obs <- ngram_Bo_Obs %>%
dplyr::mutate(prob = ifelse(frequency>0,
(frequency - nloDisc) / ngramlo_match$frequency,0))
# 5e. Generate probabilities of unobserved backed-off n-grams
ngram_Bo_unObs <-ngram_Bo_unObs %>%
dplyr::mutate(prob = prob_mass$alpha * frequency / sum(frequency) )
# Combine observed and unobserved n-grams
ngram_Bo <- rbind(ngram_Bo_Obs,ngram_Bo_unObs)
# 6. Calculate (n+1)-gram probability discount mass - using observed (n+1)-gram probabilities
prob_mass <- prob_mass %>% dplyr::mutate(alpha2 = 1 - sum(ngram_Obs_pred$prob.pred))
# 7. Calculate unobserved backed-off (n+1)-gram probabilities
ngram_Bo_unObs_pred <- ngram_Bo %>%
dplyr::mutate(feature.pred = paste(text.token[1],feature,sep="_"),
prob.pred = prob_mass$alpha2*prob / sum(prob))
# 8. Select prediction with highest probability
# Clean data frames
ngram_Bo_unObs_pred <- ngram_Bo_unObs_pred %>%
dplyr::mutate(next.word = sub(".*_", "",feature.pred)) %>%
select(feature,frequency,feature.pred,prob.pred,next.word)
ngram_pred <- ngram_Obs_pred %>%
select(feature,frequency,feature.pred,prob.pred,next.word) %>%
rbind(ngram_Bo_unObs_pred) %>% arrange(desc(prob.pred))
return(ngram_pred[1:min(nwords,nrow(ngram_pred)), c("next.word","prob.pred")])
}
}
i = i + 1
message(text.n - i)
}
# =============================
# If does not word exists in corpus - take most frequent words
# =============================
message("Running Unigram Model")
ngram_pred <- unigram_pred(freq.table_uni,text.token)
return(ngram_pred[1:min(nwords,nrow(ngram_pred)), c("next.word","prob.pred")])
}
# =============================
# Unigram Katz Back-Off Model
# =============================
unigram_KBO <- function(freq.table,text.input,text.ngram,ngram_Obs_pred){
prob_mass<- data.frame(alpha= 1 - sum(ngram_Obs_pred$prob.pred))
ngram_Bo_unObs <- freq.table %>% filter(ngram == 1, !(feature %in% ngram_Obs_pred$next.word)) %>%
mutate(prob.pred = prob_mass$alpha * frequency / sum(frequency),
next.word = feature, feature.pred = paste(feature,text.ngram,sep="_")) %>%
select(feature,frequency,feature.pred,prob.pred,next.word)
ngram_Bo_unObs
ngram_pred <- ngram_Obs_pred %>% select(feature,frequency,feature.pred,prob.pred,next.word) %>%
rbind(ngram_Bo_unObs) %>% arrange(desc(prob.pred))
return(ngram_pred)
}
# =============================
# Generate prediction from most frequent words in corpus
# =============================
unigram_pred <- function(freq.table,text.input){
ngram_pred <- freq.table %>% filter(ngram == 1) %>% mutate(prob.pred = frequency / sum(frequency),
feature.pred = paste(text.input,feature,sep="_"),
next.word = feature) %>%
select(feature,frequency,feature.pred,prob.pred,next.word) %>% arrange(desc(prob.pred))
return(ngram_pred)
}
# =============================
# Markov Chain Model
# =============================
ngram_mc_DF <- function(text.input,freq.table,nwords,len=3){
text.input <- iconv(text.input, from = "UTF-8", to = "ASCII", sub = "")
text.token <- tokens(char_tolower(text.input),
remove_punct = TRUE,
remove_symbols = T,
remove_separators = T,
remove_twitter = T,
remove_hyphens = T,
remove_numbers = T)
text.token <- as.vector(unlist(text.token))
text.n <- length(text.token)
# 1.a Split frequency table by ngram
#select maximum ngram order to inspect in frequency table
ngram_max <- min(len+1,max(freq.table$ngram))
ft_list <- list()
#create list of ngram frequency tables
for (j in 1:ngram_max){
ft_list[[j]] <- freq.table %>% filter(ngram == j)
}
rm(freq.table)
# Extract observed n-gram and discounted probability
i=0
len <- min(text.n,len)
if(len>0){i = text.n-len}
while (0 < text.n - i){
#convert text input into ngram format
text.ngram <- paste0(tail(text.token,text.n-i),collapse="_")
message("Searching for ngram... ", text.ngram)
ngram_match <- ft_list[[max(text.n - i + 1,1)]] %>% filter(grepl(paste0("^",text.ngram,"_"),feature))
# =============================
# If word exists in corpus - Markov Chain Model
# =============================
if(nrow(ngram_match)>0){
ngram_pred <- ngram_match %>% mutate(next.word = gsub(paste0("^",text.ngram,"_"),"",feature)) %>%
arrange(desc(frequency))
return(ngram_pred[1:min(nwords,nrow(ngram_pred)), "next.word"])
}
i = i + 1
}
# =============================
# If does not word exists in corpus - take most frequent words
# =============================
message("Running Unigram Model")
ngram_pred <- unigram_pred(freq.table_uni,text.token)
return(ngram_pred[1:min(nwords,nrow(ngram_pred)), "next.word"])
}